Wafer map failure pattern classification using geometric transformation-invariant convolutional neural network

Wafer map defect pattern classification is essential in semiconductor manufacturing processes for increasing production yield and quality by providing key root-cause information. However, manual diagnosis by field experts is difficult in large-scale production situations, and existing deep-learning frameworks require a large quantity of data for learning. To address this, we propose a novel rotation- and flip-invariant method based on the labeling rule that the wafer map defect pattern has no effect on the rotation and flip of labels, achieving class discriminant performance in scarce data situations. The method utilizes a convolutional neural network (CNN) backbone with a Radon transformation and kernel flip to achieve geometrical invariance. The Radon feature serves as a rotation-equivariant bridge for translation-invariant CNNs, while the kernel flip module enables the model to be flip-invariant. We validated our method through extensive qualitative and quantitative experiments. For qualitative analysis, we suggest a multi-branch layer-wise relevance propagation to properly explain the model decision. For quantitative analysis, the superiority of the proposed method was validated with an ablation study. In addition, we verified the generalization performance of the proposed method to rotation and flip invariants for out-of-distribution data using rotation and flip augmented test sets.

Bayesian method. Wu et al. 5 proposed a support vector machine (SVM) based method using a set of Radon and scale-invariant features. He demonstrated that Radon-based features can be used to acquire rotation-equivariant response. Yu and Lu 6 proposed the use of joint local and non-local linear discriminant analyses for wafer map defect detection and recognition based on multiple features, including geometric and Radon features. Saqlain et al. 7 proposed a voting ensemble classifier using various features, including Radon features. Various models employing useful features have been actively examined for these methods based on domain knowledge; however, there exists a limitation in terms of the inference performance due to the shallowness of the machine learningbased models.
As the depth of the inference model increases due to the development of computational resources, deep learning-based methods have been actively studied for wafer defect pattern classification because they can automatically learn meaningful features from raw data without expert intervention, enabling improved pattern classification performance. This deep learning-based method follows two steps: First, the deep learning framework is simply applied to the wafer map defect pattern problem; second, practical concerns, such as data scarcity and memory efficiency, are addressed. Regarding the former, early research adopted convolution neural network (CNN) models, which show exceptional performance among deep learning models in image classification, for wafer map classification 8,9 . Kyeong et al. 10 proposed mixed-type defect patterns in wafer bin maps using multiple CNN models. Yu et al. 11 proposed two stages for recognizing and classifying wafer map patterns. However, obtaining sufficient clean labeled wafer map data of high quality is often a constraint throughout the manufacturing process; therefore, a model including additional approaches to the traditional CNN model is required. Regarding the latter, several studies have proposed models based on the fact the label remains unaffected by the rotation and flip, according to the predefined labeling rule of the wafer map. Kang et al. 12 proposed a data augmentation method to learn rotation-and flip-invariant representation through augmention along a discrete angle direction. Kahng et al. 13 proposed self-supervised learning for pretext-invariant representation, which includes rotation invariance in the data-augmentation context. As a result, it was possible to achieve high classification performance in limited data situations. However, these previously proposed methods have a limitation because they do not directly incorporate rotation and flip invariance into the model architecture, which means that the model ability to recognize these invariances is not specifically built into its design. Instead, these methods rely on data augmentation and additional parameters, which can be inefficient and insufficient for addressing memory efficiency concerns. This has already been noted for rotational variable CNNs in the field of computer vision, as discussed in "Related works".
In this paper, we propose a novel method for classifying wafer defect patterns that is invariant to rotation and flip. Considering the orientation variations in wafer defect patterns due to manufacturing processes and equipment, achieving rotation and flip invariance becomes crucial for accurate and robust classification. Furthermore, by incorporating these invariances into the classification method, our approach can efficiently extract relevant features from limited data, helping to mitigate data scarcity issues. To achieve rotation and flip invariance, we utilize the equivariant traits of Radon features, a hand-crafted feature previously used in machine learning, within the CNN framework. Moreover, we achieve flip invariance by designing kernels within the network, minimizing the reliance on data augmentation. To validate our model, we conduct both qualitative and quantitative analyses. For qualitative analysis, we introduce the multi-branch layer-wise relevance propagation (multi-branch LRP) method to interpret the model decisions, specifically designed for models with multi-branch structures like our kernel flip module. We demonstrate the individual impact of Radon transformation and kernel flip through both qualitative and quantitative evaluations using an ablation study. We also evaluate our model's unseen generalization performance under rotation and flip augmented dataset.

Background and preliminaries
Related works. CNNs inherently possess a strong capability to learn translation-invariant features through translational weight sharing and pooling operations. However, achieving other forms of spatial invariance, such as rotation and flip, remains a limitation of the CNN framework. Numerous studies have been conducted to address these challenges by (1) augmenting features of an input image with several transformed copies, and (2) encoding desired transformation invariance for the CNN using specific trainable modules within the network.
The former can be broken down into input data augmentation and feature augmentation by the inner filters of the network. In many early studies, the input data were directly augmented for various applications. Laptev et al. 14 proposed a transformation invariant pooling (TI-pooling) layer taking highly activated transformationinvariant features by max-pooling to the fully connected layer, extracted over a weight-shared CNN for each input based on the rotationally augmented training dataset. Cheng et al. 15 proposed a similar method, rotation invariant CNN (RICNN), which trains existing CNNs by rotationally augmenting training samples for the object detection task. Cheng et al. 16 proposed a Rotation-invariant and fisher discriminative CNN (RIFD-CNN), also using the data augmentation strategy as RICNNs but adding a Fisher-discriminatory layer. However, directly augmenting input data has a critical limitation that fundamentally requires higher memory size and network capacity to obtain more generalizable rotation. Because of this, feature augmentation by internal filters of the network has lately gained considerable attention in a variety of methods. Dieleman et al. 17 proposed the multiple branch structure of a CNN for extracting different viewpoints for each augmented image. Then, Dieleman 18 extended this concept by performing various operations on cyclic symmetries. Cohen et al. 19 proposed a groupequivariant CNN based on group theory, utilizing a symmetry group and pooling operation on the group. Marcos et al. 20 suggested explicitly incorporating the rotation invariance method into the model by associating the weights of groups of filters with various rotated copies of the group's canonical filter. Gao et al. 21 proposed a set of kernel rotation and flip methods for achieving rotation and flip invariance in a CNN. In summary, the feature augmentation method follows the structure of sampling multiple branches for data variation within the www.nature.com/scientificreports/ network, and the main limitation of this is the trade-off relationship between generalizing the data variation and the number of branches. The second work is the utilization of certain trainable modules inside a CNN to encode required transformation invariance for the CNN. Worrall et al. 22 proposed a harmonic networks that achieves rotation invariance by replacing regular CNN filters with circular harmonics, thus returning a maximal response and orientation. Jaderberg et al. 23 proposed the spatial transformer network (STN), which uses learnable modules, explicitly allowing the spatial manipulation of input data to reduce pose variations in subsequent layers within the network. Esteves et al. 24 suggested a polar transformer network (PTN), which is an extended version of STN combining canonical coordinate representations. Dai et al. 25 proposed a deformable CNN with deformable convolution and RoI pooling based on the idea of augmenting the spatial sampling locations in the modules. These works have constraints in that they not only require additional trainable parameters for additional modules but also require a complex structure to adapt to a CNN.
In this study, we propose a novel rotation and flip invariant CNN approach for classifying wafer map defect patterns, taking into consideration the challenge of data scarcity. To achieve this, we suggest incorporating handcrafted features into a deep learning framework. Specifically, we utilize the rotation-equivariant property of the Radon feature, a commonly used hand-crafted feature in previous machine learning context for wafer classification task, to obtain rotation invariance in the CNN framework. Furthermore, we achieve flip invariance by introducing a kernel flip module with only a two-branched structure, which learns the data variation of flipped copies produced by each branch. It is worth noting that our method achieves flip invariance in all directions by securing it in combination with rotation invariance, utilizing the rotation-equivariant feature and minimal branches of the flipped kernel. This approach allows for more compact and efficient representations, potentially leading to better performance and reduced training times compared to data augmentation-based methods.

Equivariance and invariance.
To facilitate understanding of the problem statement, it is essential to first comprehend the concepts of equivariance and invariance. Given a mapping function , an input X from a set of inputs { X i }, and a group G , we call equivariant under T 1 ∈ G if the transformation of the input is related to a transformation T 2 ∈ G of the output, as stated in Eq. (1). Conversely, is invariant under T if it is independent of the transformation relationship in the output domain, as expressed in Eq. (2).

Problem formulation.
To clearly explain the proposed mechanism of obtaining rotation and flip invariance, we formulated the principle of the proposed approach including Radon transform, kernel flip, and CNN backbone module. The wafer defect pattern image data and its label set exist as X i , y i , geometrical transformations are denoted as translation: T T rotation: T R , flip: T F , and each group of each transformation is denoted as G T , G R , and G F . The labeling rule function (� label ) is given according to Eq. (3) when T = T R · T F = T F ·T R in G R ∪ G F , where T R · T F represents function composition of T R and T F , and our objective is to build a model that approximates this function: The CNN model ( � CNN ) we use for label inference has the inherent ability to learn translation-invariant features, exhibiting the following characteristics: However, the CNN model is not rotation-invariant: To provide some context for Eq. (5), let T R · X i represent the application of the rotation transformation T R to the input X i . With this understanding, we can now explain that our model uses the rotation-equivariant mapping function Radon transform (� Radon ) as an intermediate step to address the lack of rotation invariance in the CNN model.
As a result, we have: For our proposed model, we aim to achieve both rotation and flip invariance. To address the lack of flipinvariance, we incorporate the kernel flip (KF) module into the CNN architecture: The flip symmetry of the wafer map is preserved here by changing the flip axis by π /2 to account for the Radon feature effect:

Methodology
Proposed framework. The proposed rotation-and flip-invariant representation learning method comprises two main modules and a CNN backbone, as illustrated in Fig. 1. Initially, the Radon rotation-invariant module transforms wafer maps into tomography images, converting rotation to translation. Subsequently, a flipped feature set is obtained through two branches of kernel flip operations. By employing the max-out operation on the highly-activated features among the pair of flipped feature sets, the backbone CNN, often referred to as translation-invariant due to its capability of acquiring translation-invariant features, learns a discriminative representation that captures the wafer label characteristics through rotation equivariant and flip equivariant features.

Radon transformation.
Our proposed method adopts the Radon feature as input representation due to its rotation-equivariant characteristic with respect to the wafer map. Radon transformation is a method to acquire sinusoidal tomography P θ (r) by projection image for rotation θ . The Radon transform is a forward projection to obtain tomography P θ (r) . When f(x,y) is an original image, the Radon transform function is given as, The above projection converts the original image's rotation impact to a translation of the Radon feature. By comparing the first rows of Fig. 2a, b, we can recognize that the original wafer map's rotation corresponds to the Radon feature's translation. As a result, the Radon transform functions as a rotation-equivalent bridge, enabling the use of a translation-invariant CNN backbone model to obtain rotation-invariant representation. Additionally, by comparing the second rows of Fig. 2a, b, we can see that the vertical flip on the wafer map corresponds to a horizontal flip on the Radon feature. This implies that the flip equivariance of the Radon feature is inherently guaranteed to be flip equivariance for the wafer map, considering the π/2 change in the flip axis. Multi-branch LRP. In this study, we adopted the LRP to evaluate our method in a qualitative manner not only to recognize the effect on inference based on the Radon feature in accordance with the original wafer map-based prediction but also to verify that our proposed model works as intended. The LRP is primarily used to comprehend the model inference using an interpretability-based approach to deep learning-based models. Based on the deep Taylor decomposition method described by Eq. (13), the relevance score can be obtained by output prediction, where a is a root point of the Taylor series and ǫ is a substituted term for the Taylor series' higher-order polynomial terms. By sequentially repeating the relevance propagation to previous layers, the input layer's relevance scores can finally be obtained.
To apply this technique to our model, there is a structural consideration that it is difficult to propagate the relevance score as-is because our model is a multi-branch model. To the best of our knowledge, the LRP method has not been used in a complicated structure such as a multi-branch CNN before. Herein, we propose a novel LRP method for the multi-branch structure, as depicted in Fig. 1. When the relevance score has arrived at the kernel flipping modules, two relevance scores are generated after passing each kernel. The propagation of the separated relevance score provides multiple relevance scores that are unrelated to the model judgment grounds at the input layer. To solve this structural problem, we concatenate both relevance scores and both kernels by channel axis. Then, we propagate the relevance through the concatenated relevance feature and kernel to generate a combined relevance score.

Results and discussion
Experiment. Data description. In general, wafer map patterns are categorized into seven classes based on their cluster position and shape, which has specific process conditions and effects 27 : center, donut, edge-loc, ring, loc, scratch, and random. For example, the center type has the effect of problems in the plasma area 28 or thin-film deposition, and the edge-loc type has the same effect as uneven heating during the diffusion process. Therefore, it has been considered an important task to classify them and determine the state of the process so that the cause of process deterioration can be estimated. Existing machine learning-based wafer sorting tasks have mainly been researched under two scenarios: individual fab data and open data 27 , each with pros and cons. Using private data is advantageous for optimizing the problem at hand, but methodological generalizations are difficult. However, publicly available data are easier to compare with other methods, implying that the method's generalization could be claimed; hence, it is preferable to utilize it for verification.
The real-world fab data WM-811K has frequently been used in wafer classification tasks via machine and deep learning 29 . For data representation, each wafer map is formed as a 2D image of varying sizes. As shown  Table 1, it has a highly imbalanced data distribution, i.e., near-full class accounts for only 0.1%. The appropriate data processing for the evaluation is addressed in "Experimental setup".
Experimental setup. To assess our proposed method effectiveness, we utilized the seven typical classes from WM-811K as indicated in Fig. 3, with setting balanced data distributions for each class. Previous researches on wafer map pattern classification using WM-811K can be classified into two categories. The first case uses nine classes, while the second only takes seven or eight classes, depending on whether it contains the none or near-full classes. Mohamed et al. 30 highlighted the negative effects of using the none class, as it can impact both model training and performance analysis for several reasons. Thus, we followed the latter approach by taking seven classes excluding 'Near-full' and 'None' classes to focus on addressing data scarcity, aside from the data imbalance problem. Then, we sub-sampled train and test datasets for the seven classes with a small dataset ranging from 100 to 6,400 with a balanced data size for each class. To preprocess the data, we first resized the wafer map to (64, 64), and removed the wafer map background, retaining only the defect points due to varying wafer map sizes, which can lead to slightly different shapes on the sides after resizing, thus affecting model training negatively.
To comparatively evaluate the proposed model via an ablation study, we established four comparative models. The first, a baseline model, utilized the wafer map as input to the baseline network, as detailed in Table 2. The second model, the Radon model, took the Radon transformation before inputting the wafer map into the same baseline network. The third model, the kernel flip model, had a two-branched kernel flip module within the baseline network and used the wafer map as input. Lastly, the proposed model incorporated both the Radon transformation and the kernel flip module onto the baseline model which is also detailed in Table 2.
In the experiments, the initial learning rate was set to 0.0003, and the Adam optimizer was used for updating the model weights. The learning rate decay was used for every epoch with a decay rate of 0.99. The training steps were stopped early when the validation loss did not decrease for 30 epochs to prevent overfitting. The loss function used was the Cross Entropy Loss, which is suitable for classification tasks. Each experiment was repeated 20 times using different random seeds. The results are reported as the average and standard deviation of all the repeated measurements.  www.nature.com/scientificreports/ Evaluation strategy. To evaluate the performance of our proposed method, we conducted both quantitative and qualitative analyses. Firstly, we performed a qualitative analysis using the LRP method to verify the adequacy of our proposed method. Specifically, we visually examined the LRP heat maps to analyze how the model focuses on different parts of the wafer map to make decisions. Additionally, we verified the effect of rotating and flipping the original wafer map on the proposed model inference by assessing how these transformations affect the model attention to the wafer map. Throughout these experiments, we compared the qualitative performance of the baseline and proposed methods. As the LRP heatmap for the proposed method is based on Radon features, direct comparison with the baseline was difficult. Thus, we applied an inverse Radon transform to the relevance scores obtained from Radon feature-based inference, using the projection-slice theorem to verify the consistency between the original wafer map and Radon feature-based inference. This allowed us to compare the proposed method with the baseline. Secondly, we conducted a quantitative analysis to evaluate the performance of the proposed model. Initially, we performed an ablation study to verify the validity of the proposed method by analyzing the effect of each module on the overall performance of both the entire and sub-classes. In addition, we assessed the impact of rotation and flip on the proposed model performance for each class using the confusion matrix. The degree of variation for rotation and flip differs depending on the wafer map pattern, with some classes exhibiting insignificant variation while others display wide variation. For instance, the center and donut classes contain uniformly defective points in all directions, resulting in insignificant variation for rotation and flip, while the scratch class has a wide variation for flip and rotation since it exists in curved or straight line forms independent of direction and location.
Lastly, to validate the generalization performance of our model, we conducted a thorough comparison of the performance of the proposed model and comparative models on the original test set and an unseen (outof-distribution) augmented test set. Specifically, we evaluated the ability of the models to generalize to unseen distributions for rotation and flip transformations. While the original test set can be considered unseen as it was not used in training, it was still limited to the distribution within the original dataset. To assess the proposed model robustness to generalization, we generated a dataset by directly rotating and flipping the test set to extend beyond the distribution of the original dataset. The rotation augmented test set included 90°, 180°, and 270° rotationally augmented test sets, while the flip augmented test set included horizontally and vertically flipped test sets. We then integrated the two augmentation methods for rotation and flip. It is important to note that the augmented test set did not include the original test set. This comparison allowed us to confirm the validity of the proposed model architecture and verify its robustness to unseen situations.

Qualitative analysis. Radon transform-based classification.
To begin, we confirm how the model decision is made for label classification with the obtained LRP heatmaps. Figure 4 compares the baseline model for each class to the proposed model's relevance score. By examining the second column, it is clear that the baseline model is primarily concerned with the visual pattern represented on the wafer map. Meanwhile, due to the difficulty of directly interpreting the Radon model decision, it was compared using the transformed relevance by inverse Radon transform, as depicted in the fifth column. As a result, it was determined that the proposed www.nature.com/scientificreports/ model also corresponds to the defect pattern on the wafer map. This is a significant finding because it demonstrates that the shape information contained in the wafer map is retained even when the model is evaluated solely based on the Radon feature. Moreover, by comparing the prediction outcomes, it is evident that the proposed model focuses exclusively on the primary defect location, which explains the higher classification performance.
In particular, The results show that for classes such as C3 and C7, the proposed model pays more attention to the location of clear patterns compared to the baseline. This observation is consistent with the fact that C3, C5, and C7 have a wide range of variations in rotation and flip transformations, making it difficult for the baseline model to learn class-discriminative features. In contrast, the proposed model shows robust learning with regards to rotation and flip transformations, which could be the reason behind the observed performance improvement. This finding provides evidence that the proposed method is effective in learning more robust and discriminative features in the presence of diverse image transformations, which can be especially useful for challenging realworld scenarios.
Rotation and flip invariant classification. Figure 5 compares the relevance scores of the baseline and proposed models while rotating and flipping the test set by the multi-branch LRP method. The wafer map and Radon feature rows 1-4 exhibit that rotation of the wafer map acts as a translation of the Radon feature, and rows 5-8 demonstrate that vertical flipping of the wafer map acts as horizontal flipping of the Radon feature. Based on the LRP heatmap obtained by the proposed model, the activated region is translated horizontally for the rotated Another notable point is that whenever the original wafer map is rotated and flipped, the relevance score of the baseline model pays attention to various different positions, but the proposed model focuses more on the defect points of the original wafer map. This indicates that the proposed model has high robustness classification performance for the input wafer rotation and flip variations, which is also the reason why it shows improved classification performance for the original and augmented test sets, as discussed later in "Quantitative analysis". Quantitative analysis. Classification performance comparison. Figure 6a and Table 3 present a comparison of the classification accuracy of the comparative models for various train set settings. The Radon and kernel flip models, as well as the proposed model, exhibit higher classification accuracy than the baseline model. www.nature.com/scientificreports/ Notably, the Radon model performs better than the kernel flip model, indicating that the wafer map patterns exhibit more variation for rotation than for flip. Of all the methods, the proposed model achieves the highest performance, indicating that invariance is ensured for both rotation and flip. Figure 6b-d presents a comparison of the baseline and proposed models in terms of class accuracy. Figure 6b shows the difference between class accuracy, which is a diagonal element of the confusion matrix (Fig. 6c, d). Figure 6b indicates that the proposed model has a higher accuracy for all classes than the baseline model. In particular, C3 (edge-loc), C5 (loc), C6 (random), and C7 (scratch) are significantly increased among all classes.  www.nature.com/scientificreports/ This trend is matched with the fact that this class has considerably more rotation and flip variance than the other classes. Therefore, it can be confirmed that the high accuracy of the proposed model is derived from the rotation and flip invariance.
Generalized classification performance for unseen rotated and flipped test set. Table 4 compares the classification accuracy of comparative models for augmented test sets. In rows 1-2, the baseline and kernel flip models are evaluated under the flip augmented test set. In rows 3-4, the baseline and Radon models are evaluated under the rotation augmented test set. In rows 5-6, the baseline and proposed models are evaluated under the rotation and flip augmented test set. For all cases, comparative models score higher accuracy than the baseline model. This means that the proposed model and its ablation models work as rotation-or flip-invariantly to the unseen augmented situation for rotation or flip. Figure 7 shows the classification accuracy of comparative models for the original and unseen augmented situations at a train set of size 6400. Figure 7a depicts the evaluation result for the original test set and flip augmented test set of both baseline and kernel flip models, Fig. 7b depicts the evaluation result for the rotation augmented test set of both baseline and Radon models, and Fig. 7c depicts the evaluation result for the rotation and flip augmented test set of both baseline and proposed models. As illustrated in Fig. 7, the Radon, kernel flip, and proposed models all achieve increased accuracy over the baseline model in each augmented test set. However, in all three cases, the accuracies are slightly decreased between two situations. It is noteworthy that the reduction gap between the baseline models is larger than that of other comparative models. This can be interpreted as the proposed model having a higher resistance to performance degradation in the generalization performance at an unseen augmented situations. Figure 8 compares the generalization performances for each class between the proposed and baseline models on a train set of size 6400. Figure 8a shows the difference in the class accuracy of the baseline models presented in Fig. 8b (the original test set) and Fig. 8c (the rotated and flipped augmented test set). Figure 8d shows the difference in the class accuracy difference between Fig. 8e (the original test set) and Fig. 8f (the rotated and flipped augmented test set) for the proposed model. Figure 8g shows the difference between Fig. 8d and Fig. 8a, which demonstrates that the proposed model has better generalization than the baseline model for each class. From Fig. 8d, we can see that the proposed model has a higher resistance to performance degradation in terms of generalization for an unseen augmented dataset for all the classes, while the classes C3 (edge-loc), C5 (loc), and C7 (scratch) show a significant increase. This extraordinary generalization performance for rotation and flip sensitive classes demonstrates that the proposed model effectively preserves the rotation and flip invariance. Additionally, this trend is in accordance with the findings of the original test set discussed in "Classification performance comparison".

Conclusion
In this paper, we introduce a novel method for achieving rotation and flip invariance in wafer map defect pattern classification, utilizing a combination of Radon transform and kernel flip techniques. The Radon feature ensures rotation invariance by transforming the original wafer map rotation into translation, while the kernel flipping approach provides flip invariance. Our proposed method employs an efficient network structure with a minimal number of flipped kernel branches by appropriately combining these two modules. We validate our model extensively using the WM-811K dataset with both qualitative and quantitative evaluations. Our proposed model's interpretability is demonstrated by verifying its decisions using the newly suggested multi-branch LRP method. The proposed model achieves high detection performance, even in limited data situations, by successfully ensuring rotation and flip invariance. Additionally, we assessed the proposed method's generalization performance regarding rotation and flip invariants on out-of-distribution data by using rotation and flip augmented test sets. Our study provides an efficient end-to-end deep learning model that appropriately reflects the characteristics of wafer labeling and can serve as a suitable baseline for wafer diagnosis in the future.